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Abstract 

For latent class models where the class weights depend on individual covariates, 
we derive a simple expression for computing the score vector and a convenient 
hybrid between the observed and the expected information matrices which is 
always positive definite. These ingredients, combined with a maximization al- 
gorithm based on line search, provides an efficient tool for maximum likelihood 
estimation. In particular, the proposed algorithm is such that the log-likelihood 
never decreases from one step to the next and the choice of starting values is 
not crucial for reaching a local maximum. We show how the same algorithm 
may be used for numerical investigation of the effect of model mispecifications. 
An application to education transmission is used as an illustration. 

Keywords: Latent class models, individual covariates, Fisher-scoring 
algorithm, line search. 



1. Introduction 

The latent class models considered in this paper are those where subjects 
belong to one among a finite set of disjoint latent classes with probabilities which 
may depend on individual covariates. Observations are based on a collection 
of discrete responses whose distribution depends on the latent type b ut not on 
covari ates. The literature on latent class models is very extensive, see Vermunt] 



(|201Cl ) for a convenient selection of some of the most relevant contributions; a 
slightly more extended framework, dea ling with missi ng data and known groups 
of distinct respondents is presented by[Ch una (|2003). 



The EM (expectation-maximization) algorithm is generally used to compute 
the maximum likelihood estimates, though, for instance, the Latent GOLD soft- 
ware combines EM and Newton Raphson. As regards the EM algorithm, its 
numerical stability and the fact that the likelihood always increases from one 
step to the next, are mentioned as its main advantages. The Newton-Raphson 
algorithm, though faster, is known for being likely to diverge, unless the start- 
ing values are close to a local maximum; in particular, in the context of latent 
class models, the algorithm cannot be used safely on their own. The perfor- 
mance of the Newton-Raphson or Fisher-scoring algorithms may be greatly im- 
proved by performing a line search to optimize the step length (see for example 
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Potra and Shi , 19951 : Turner, 20081) and adopt suit able strategies that prevents 



the likelihood from decreasing. iBolck et al.l f|2004h have proposed a three step 
algorithm which, in the first step estimates a latent class model without co- 
variates, assigns subjects to latent classes and estimates the latent regression 
model with weights derived from the estimat ed matrix of cla ssification errors. 
An extension of this approach is proposed by IVermunt ( 2010l ). 

In this paper we propose a convenient matrix formulation that allow to derive 
simple expressions for computing the score and the observed or the expected in- 
formation matrix. The expected information matrix has the advantage of being 
always positive definit e whenever the model is i dentifiable: on the other hand, 
it has been argued fsee lEfron and Hinklevl . ll978l ) that the observed information 
matrix is preferable for the asymptotic distribution of the maximum likelihood 
estimates because it is data dependent. We show that there is a component 
of the observed information which is easier to compute, is always positive defi- 
nite and such that its expectation is still equal to the expected information; we 
suggest using this hybrid information matrix in the maximization algorithm. In 
addition we describe the main features of an intelligent software which combines 
line search and strategies to prevent the likelihood from decreasing. With a mi- 
nor modification, the same algorithm may be used to maximizes the expected 
log-likelihood; this could be used as a numerical tool to assess consistency of 
estimates under possible mispecifications of the model, when theoretical results 
are not easily available. 

In section 2, after introducing the notations, we derive an expression for the 
score and an approximation of the information matrix which, we show, is positive 
definite, under suitable conditions. In section 3 we discuss the computation of 
the previous quantities and describe a suitable line search algorithm; in addition 
we show how the same algorithm may be used for numerical assessment of the 
effect of model mispecifications. In section 4 we present an application from the 
field of education transmission. 



2. Notations and main results 

Suppose there are c disjoint latent classes and let 7Tj, i = 1, . . . ,n, be the 
vector of prior probabilities for the ith subject to belong to one of the c latent 
classes; let Xi be a c x k matrix depending on individual covariates and assume 
that 

exp(X l /3) 
771 ~ l' c cxp(Xif3)' 

Let r be the number of possible configurations of the response variables; their 
joint distribution conditional on U = j, j = 0, . . . , c— 1, may be represented by 
the r x 1 vector of probabilities 

exp(G0j) 
Qj ~ i;cxp(G6>,)' 

where G is a r x g full rank design matrix and Oj a suitable vector of log-linear 
parameters. The dimension r of is equal to the product of the number of 
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categories of the response variables with entries in a given lexicographic order. 
This formulation does not necessarily assume conditional independence among 
the responses; the conditional dependence structure is determined by the G 
matrix and we assume that this is such that the model is identifiable. Finally, 
let Q be the matrix whose jthe column is , so that p i = Qiti is the marginal 
distribution of the responses. If we stack the vectors 8j one below the other 
into the vector <?, the contribution of the ith subject to the log-likelihood may 
be written as £(/3, 0; y 4 , X t ) = y[ log^). 

2.1. Score vector and information matrix 

Under the assumption that observations from different subjects are indepen- 
dent, the score and the information may be written as sums across subjects. By 
application of the chain rule, the score relative to (3 is 

sp = Y J X^ i Q l Amg( Pl )- 1 y l 

i 

where il 7ri — diag^i) — 71^71--. Noting that p t = J2j ^ijQj^ by the chain rule, 
the score relative to Oj is 

i 

where flj = diag(q j ) - q^. 

It is convenient to think of the observed information matrix as made of two 
components: call F the matrix which we obtain by treating the score vector 
as a function of diag(p i ) while all the rest is held constant and call D the 
matrix which wc obtain by differentiating the score vector while diag(p i ) is held 
constant. Let 

A i = (Qfl 7ri X i TTafiiG ... n lc fl c G) 
and diag(dj) = diag(p J )" 2 diag(y i ). 

Lemma 1. The matrix F is equal to J2 i ^4^diag(<i i )A i and E(D) = 0. 

Proof. See the Appendix. □ 

When individual observations are available, y i = e u u\, a vector of O's except 
for the it(i)th entry which is 1; let ij = A^e^^, let T the n x (k + eg) matrix 
with rows t\\ let pi be the tt(i)th element of (p,) and p the vector with elements 

Lemma 2. The hybrid information matrix F is positive definite if and only if 
T is of full rank k + eg. 

Proof. The result follows because, by simple algebra, F = ^ f tit^/p? = T'diag(p) _2 T. 

□ 
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In practice, F is positive definite whenever, within the n observations, there 
are at least n > k + eg distinct patterns of covariate configurations; in that case 
the model is identifiable. 

The result of Lemma 2 seems to suggest that F may be used as an approxi- 
mation for both the observed and the expected information matrix. Relative to 
the observed information, it has the advantage that it is positive definite like the 
expected information. However, as we show below, it is more easily computed 
then the expected information and, in addition, it is partly data dependent. 



3. Computational aspects 

First we note that the whole score vector may be computed as 



E 



Though the Ai matrices involve, apparently, several matrix multiplication, as 
we show below, they need not be computed explicitly. Each tj vector may be 
constructed by stacking one below the other the following components, where 
q , i = Q e u u) is the u(z)th column of Q 1 , 

X' t n ni Q'e u{l) = X-diag(7r i )q t - X'^^'^ 

and, for j = 0, . . . , c — 1, 

~/' .'/!'/... - G'qjQij), 

where g i is the u(i)th column of G' . 



3.1. Line search 

Let ip be the vector obtained stacking (3 and 9 one below the other; after 
h — 1 steps, the basic updating equation takes the form 



1(h) 



V + a-h- 



where a,h-i is the step length. When the log- likelihood is not concave and may 
have two or more local maxima, an algorithm with = 1 is almost certain 
to diverge, unless the starting value is very close to a local maximum; one 
possibili ty would be to set ao very small and let it increase with h. In a related 
context, [Turner] (|2008l ) suggest using the LevenbergMarquardt algorithm which 
combines Newton-Raphson and steepest ascent steps; this would be less efficient 
in our context where the information matrix is positive definite. Our algorithm 
uses a proper line search where the log-likelihood is never allowed to decrease. 
Its main features are given below: 



1. set ao to some value possibly smaller than 1, say, 0.5; 
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2. at the (h — l)th step, first use the updating equation to compute the first 

~ (h,a) 

guess, say, ip ; 

3. compute the log-likelihood and the score at the first guess; 

4. with these elements find the step length that maximizes a cubic approxi- 

- (h.b) 

mation to the log-likelihood, let ip be the second guess; 

- (h.b) 

5. compute the log-likelihood at xjj and select the best guess; 

6. in case of no improvement, first shorten the step and, if even this does not 
work, perform a steepest ascent step. 

Few other adjustments are made to check whether the log-likelihood is locally 
concave or that the second derivative is negative along the given direction in 
order to perform some conditional adjustments to the step length. In any case, 
a starting point is never updated unless a better one has been found. 

In order to increase the probability of reaching a global maximum, after con- 
vergence, a random perturbation is applied to the estimates and the algorithm 
restarted for a few times. 

3.2. Numerical assessment of the effect of mispecifications 

We now show how the expressions for the score and the information matrix 
may be used to assess the effect of model mispecifications. Suppose that M 
is the true model and M. is a unspecified model; mispecifications may concern 
the number of latent classes, the regression model determined by X-i or the 
dependence structure of responses encoded into the G matrix. 

Suppose, for simplicity, that the n covariate configurations are kept fixed 
while the number m of the replicates y a , corresponding to each configuration, 
increases. Then, the low of large numbers may be used to show that the average 
log-likelihood function converges to its expected value at the true model. Thus, 
if we want to assess the effect that a given mispecifications of the model has 
on the estimates of the parameters of the unspecified model when no theory is 
easily available, we simply maximize the appropriate expected log-likelihood. 

This may be easily performed by the same adjusted Fisher scoring algorithm 
described above: we may use the expressions for the score vector and information 
matrix and simply replace the observations y i with their expected value under 
the true model. The only difference is that the simplified expressions described 
above can no longer be used. On the other hand, based on our experience, 
the expected log-likelihood seems to very well behaved so that convergence is 
usually reached in very few steps. 

4. Application 

4-1. The data 

We use data from the National Child Development Survey (NCDS), a UK 
cohort study targeting all the population born in the UK between the 3rd to 
the 9th of March 1958. Information on family background and on schooling 
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and social achievements for the subjects in the sample were collected at dif- 
ferent stages of their lives. In the application below we use, as covariates, the 
number of years of education and the amount of concern for the child edu- 
cation (as graded by the teachers), separately for father and mother. As re- 
sponse variables we consider the performance in mathematics and reading test 
scores taken when the child was 7, 11 and 16 years old, an overall measure 
of non cognitive attitudes (as reported by teachers) and the academic quali- 
fication achieved (none, O-level, A-level, university degree). Overall we use 8 
responses, all, except for academic qualification, were coded into three categories 
based on quantiles. A complete description of the original data is available at 
http : //www. esds . ac .uk/longitudinal/access/ncds. 

4-2. The model 

The vector of prior weights 7r^ was assumed to depend on the four covariates 
(education and interest for each parent) as in a multinomial logistic regression, 
this requires 4(c — 1) regression parameters and c — 1 logit intercepts. The 
response variables were assumed conditionally independent, except for a first 
order autoregressive model within Math and Read test scores taken at adjacent 
dates; because each of these variables has three ordered categories, to use a 
parsimonious model, in place of the 4 interactions, we used a vector of scores 
with values 1, 0.5 and according to whether the categories of the two response 
variables (say Math at 16 and Math at 11) were equal, differed of 1 or of 2. 

For simplicity, in this application we restrict attention only to the 2568 
females with no missing data for the selected variables. Because the relative size 
of the selected sub-sample is slightly less than 30%, results should be interpreted 
with care. Similar models with a number of latent classes ranging from 2 to 5 
were fitted and the Bayesian Information criteria was used to determine that 
the model with four latent classes was the most adequate. 



Table 1: Bayes information criteria 



c 


2 


3 


4 


5 


Bic 


37165 


36577 


36507 


36592 



Jj.,3. Main results 

The estimated regression coefficients and z ratios for the logits of belonging 
to the the different latent classes relative to the first are displayed in Table [2j 
All regression coefficients are positive and most are also significant; this seems 
to suggest that the first latent class contains subjects with the lowest cognitive 
abilities and that the parents concern, probably associated to their pressure, is 
important in pushing up. The education of the father seems to have a positive 
significant effect most of the times, not so the education of the mother; however 
father education might be a proxy for family income. 
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Table 2: Estimated regression coefficients for the latent weights 



U = 1/U = U = 2/U = Q U = 3/U = 



Model 





z 


P 


z 


P 


z 


Int. 


0.962 


6.40 


0.518 


5.16 


1.317 


8.34 


F.Ed. 


1.239 


8.87 


0.135 


1.34 


2.001 


10.02 


M.Ed. 


0.028 


0.16 


0.145 


1.53 


0.388 


2.44 


F.In. 


0.296 


2.85 


0.385 


3.9 


0.831 


5.70 


M.In. 


0.334 


3.52 


0.607 


3.72 


1.239 


6.53 



To characterize the nature of the latent classes better, we display the condi- 
tional distributions of the academic qualification and that of the non cognitive 
score tests in Table [3] It emerges that academic qualifications and latent classes 



Table 3: Conditional distributions of academic qualification and non cognitive scores 







Academic qual. 




Non cognitive 


u 


None 


O-lev 


A-lev 


Univ 





1 2 





0.9584 


0.0328 


0.0079 


0.0009 


0.0534 


0.2027 0.7439 


1 


0.6198 


0.3027 


0.0038 


0.0736 


0.2443 


0.3814 0.3743 


2 


0.1908 


0.5579 


0.0830 


0.1683 


0.4344 


0.3845 0.1811 


3 


0.0611 


0.2243 


0.2225 


0.4920 


0.5644 


0.3131 0.1225 



are stochastically ordered, thus, relative to this response, classes are ordered 
from worst to best. Instead, relative to non cognitive tests, latent classes are in 
reverse order and this seems to indicate that non cognitive scores are probably 
a measure of problematic behaviour. A similar picture emerges from Table [4j 
essentially subjects in latent class 3 are the most talented both in Math and 
Read. 

Table 4: Estimated conditional distribution of Math and Read scores at the age of 16 







Math 






Reading 




u 





1 


2 





1 


2 





0.8720 


0.1280 


0.0000 


0.8982 


0.1018 


0.0000 


1 


0.5703 


0.4029 


0.0268 


0.5023 


0.4519 


0.0459 


2 


0.1285 


0.5475 


0.3239 


0.0736 


0.5547 


0.3717 


3 


0.0040 


0.0653 


0.9307 


0.0014 


0.1403 


0.8583 
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Appendix 

Proof of Lemma 1 

Information relative to (3: 

to differentiate sp with Q.^. held constant, use the properties of the inverse and 
diagonal operators to differentiate with respect to the jth element of ni which 
gives 

-X^^ i Q'diag(p 4 ) _2 diag(y t )q J ; 

the result follows by stacking these row vectors one to the side of the other and 
then apply the chain rule. When diag^)^ 1 is held fixed, let Vi = Q'dia,g(p i )~ 1 y i , 
then compute the derivative with respect to 7r- and apply the chain rule to ob- 
tain 

X^[diag(^) - {n'vi)I - Ti-jt^n^Xj. 

To show that the expected value of this expression is note that E{v{) = Q'l 
= l c , diag[_E(t>i)] = I, the identity matrix and that v'^l^. = 0'. 
Information relative to 6: 

The derivative of Sh with Slj held constant may be computed as above giving 
terms of the form 

-7Ty7r 4ft G'f2 j diag(p,)~ 2 diag(y i )f2 ft G. 

Let Vi — diag(p i ) _1 y i and g h the hth. column of G', to compute the derivative 
with Vi held fixed, first differentiate with respect to the elements of and then 
use the chain rule to get 

■KijG'(dia,g(vi) - {q'jVijl - q'jV^SljG. 

Because E{vi) = l r , this expression has expectation. 
The mixed information: 

In practice it is convenient to differentiate each Sj with respect to . With 
techniques similar to those used above, the component where the initial wj is 
held fixed is 

-Tr ll G'n : jdia,g(p i y 2 dia,g(y)Qfl„X , 
and the other component is simply 

Sj(xj - ir'Xi), 

this has expectation because E(sj) = 0. 
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